spatialHeatmap 2.1.3
The primary utility of the spatialHeatmap package is the generation of spatial heatmaps (SHM) for visualizing cell-, tissue- and organ-specific abundance patterns of biological molecules in anatomical images. This includes the identification of genes with spatially enriched (SE) expression patterns as well as clusters and/or network modules composed of genes sharing similar expression patterns. These functionalities are described in the main vignette of the spatialHeatmap package.
The following describes extended functionalities for integrating bulk tissue
with single cell data by co-visualizing them in composite plots that combine
spatial heatmaps with embedding plots of high-dimensional data. The main input
data required for this co-visualization are quantitative assay data and
anatomic images. The assay data can be provided as data.frame,
SummarizedExperiment (SE) or SingleCellExperiment (SCE) objects, while the
anatomic images need to be supplied as annotated SVGs (aSVGs). The creation of
aSVGs is described in the main vignette of this package. For the embedding
plots of single cell data, several dimensionality reduction algorithms (e.g.
PCA, UMAP or tSNE) are supported. To associate single cells with their source
tissues, the user can choose among three major methods including annotation,
manual and automatic. Like most functionalities in spatialHeatmap, these
functionalities are available from within R as well as the corresponding Shiny
app (Chang et al. 2021, @shinydashboard).
The spatialHeatmap package can be installed with the BiocManager::install command.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("spatialHeatmap")
Next, the packages required for running the sample code in this vignette need to be loaded.
library(spatialHeatmap); library(SummarizedExperiment); library(scran); library(scater); library(igraph); library(SingleCellExperiment); library(BiocParallel)
The following lists the vignette(s) of this package in an HTML browser. Clicking the name of the corresponding vignette will open it.
browseVignettes('spatialHeatmap')
To reduce runtime, intermediate results can be cached under ~/.cache/shm.
cache.pa <- '~/.cache/shm' # Set path of the cache directory
To connect cell types with the corresponding tissues, the following describes three main methods that are available to the user, including annotation-based, manual and automatic.
The annotation method allows to provide known cell-to-tissue annotations. These
cell labels are stored in the label column of the colData slot in an SCE object.
To populate this column from R, the cell labels can be provided in form of a list, while
mouse actions can be used in the Shiny App. [ThG:
the Shiny part makes no sense in this annotation section since mouse actions would fall
under manual mode. There must be an upload option in the app.]
The following example uses single cell data from oligodendrocytes of mouse brain (Marques et al. 2016). Prior to co-visualizing, the single cell data are pre-processed with standard QC and normalization methods. The details of these steps are described in the OSCA tutorial.
To obtain reproducible results, a fixed seed is set for generating random numbers.
set.seed(10)
The single cell sample data set can be imported as follows.
sce.pa <- system.file("extdata/shinyApp/example", "sce_manual_mouse.rds", package="spatialHeatmap")
sce <- readRDS(sce.pa)
The QC, normalization, and dimensionality reduction steps can each be performed with a single command. Additional details for each of these steps are provided in the Supplement section. [ThG: you need to at least briefly state what algorithms/method are used in each step.]
sce.qc <- qc_cell(sce, qc.metric=list(subsets=list(Mt=rowData(sce)$featureType=='mito'), threshold=1), qc.filter=list(nmads=3))
sce.norm <- norm_cell(sce=sce.qc, quick.clus=list(min.size = 100), com.sum.fct=list(max.cluster.size = 3000), log.norm=list())
The following command returns the cell annotation information from the label
column of the colData component of the SingleCellExperiment object, here
sce.dimred.
unique(colData(sce.dimred)$label)
## [1] "hypothalamus" "SN.VTA" "dorsal.horn" "cortex.S1" "Amygdala" "corpus.callosum" "zona.incerta" "striatum" "hippocampus.CA1" "dentate.gyrus"
The following embeds the high-dimensional gene expression data into a 2-3
dimensional space using PCA, tSNE and UMAP. All three embedding result sets
are stored in the SingleCellExperiment object. Subsequently, the UMAP result is
visualized as an example in form of a scatter plot where the dots are colored by
the corresponding cell labels.
sce.dimred <- reduce_dim(sce=sce.norm, prop=0.1, min.dim=13, max.dim=50, model.var=list(), top.hvg=list(n = 3000), de.pca=list(assay.type = "logcounts"), pca=FALSE, tsne=list(dimred="PCA", ncomponents=2), umap=list(dimred="PCA"))
plotUMAP(sce.dimred, colour_by="label")
Figure 1: Embedding plot single cells
The cells are colored by labels in the label column in colData.
Aggregate cells by annnotations defined in the label column.
[ThG: First, the previous is an incomplete sentence that should only be used in a title.
Second, This section is confusing. What do you mean by ‘aggregate’? Aren’t you using
the existing annotations to compute a summary statistics for each predefined group of
cells belonging to a tissue specified by the annotations? Subsequently, the summary
values are used to define the colors in the spatial heatmap, correct? If so try to
outline this more clearly. The same applies ot the help file of aggr_rep.
sce.aggr <- aggr_rep(sce.dimred, assay.na='logcounts', sam.factor='label', con.factor='expVar', aggr='mean')
Next, the spatial features are extracted with the return_feature function
from an aSVG image of mouse brain. This sample aSVG image file is provided by
the package.
svg.mus.brain <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
feature.df <- return_feature(svg.path=svg.mus.brain)
## Accessing features...
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
## mus_musculus.brain.svg,
feature.df$feature
## [1] "path16294" "path16312"
## [3] "path16316" "path5402"
## [5] "medulla.oblongata" "cerebral.cortex"
## [7] "corpus.striatum" "diencephalon"
## [9] "pituitary.gland" "hippocampus"
## [11] "cerebellum" "brainstem"
## [13] "midbrain" "dorsal.plus.ventral.thalamus"
## [15] "hypothalamus" "nose"
## [17] "corpora.quadrigemina"
Create a named list to match cells and aSVG features. [ThG:
Why is do you select here only one feature/tissue instead of all annotated features.
Also the need/utility of this list is unclear.
lis.match <- list(hypothalamus=c('hypothalamus'))
Co-visualization on gene Cops5.
colData(sce.aggr)[, c('label', 'expVar')]
## DataFrame with 11 rows and 2 columns
## label expVar
## <character> <character>
## hypothalamus__control hypothalamus control
## SN.VTA__control SN.VTA control
## dorsal.horn__control dorsal.horn control
## cortex.S1__control cortex.S1 control
## Amygdala__control Amygdala control
## corpus.callosum__control corpus.callosum control
## zona.incerta__control zona.incerta control
## hypothalamus__6h.post.stress hypothalamus 6h.post.stress
## striatum__control striatum control
## hippocampus.CA1__control hippocampus.CA1 control
## dentate.gyrus__control dentate.gyrus control
[ThG: there needs to be a detailed explanation or legend explaining what is shown on the plot. Again coloring only one tissue creates the impression that tissues need to be depicted one by one instead of a single plot. I would change this example to highlight most tissues. Most genes are not so selectively expressed to show up in only one tissue, so choosing here an example where this is the case shouldn’t be hard.
shm.lis <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr, ID=c('Cops5'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=2, sce.dimred=sce.dimred, dimred='PCA', cell.group='label', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match, bar.width=0.1, dim.lgd.nrow=1)
## Coordinates: mus_musculus.brain.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
## Features in data not mapped: SN.VTA, dorsal.horn, cortex.S1, Amygdala, corpus.callosum, zona.incerta, striatum, hippocampus.CA1, dentate.gyrus
## ggplots/grobs: mus_musculus.brain.svg ...
## ggplot: Cops5, control 6h.post.stress
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## Cops5_control_1 Cops5_6h.post.stress_1
## Converting "ggplot" to "grob" ...
##
## Converting "ggplot" to "grob" ...
## dim_Cops5_control_1 dim_Cops5_6h.post.stress_1
Figure 2: Co-visualizing single cells and bulk tissues by custom cell labels
The expression profiles of gene Cops5 are used.
In the manual method, cell cluster labels are manually created through clustering algorithms. The same single cell data and aSVG file as in the annotation method are used to demonstrate the manual method. The steps of quality control, normalization, and dimensionality reduction are the same with the annotation method, while a subsequent clustering step is involved to created cell labels.
[ThG: I don’t understand why the manual method uses clustering??? If clustering needs to be used here then it is unclear what clustering algorithim is applied.]
The dimensionality reduction and clustering steps are wrapped in the same
function cluster_cell. Individual steps are seen in the Supplementary
Section.
[ThG: why does the embedding need to be rerun when it was generated before???]
sce.clus <- read_cache(cache.pa, 'sce.clus')
if (is.null(sce.clus)) {
sce.clus <- cluster_cell(data=sce.norm, prop=0.1, min.dim=13, max.dim=50, pca=FALSE, graph.meth='knn', dimred='PCA', model.var=list(), top.hvg=list(n = 3000), de.pca=list(assay.type = "logcounts"), tsne=list(dimred="PCA", ncomponents=2), umap=list(dimred="PCA"), knn.gr=list(), snn.gr=list(), cluster.wk=list(steps = 4))
save_cache(dir=cache.pa, overwrite=TRUE, sce.clus)
}
The cluster labels are exclusively saved in the cluster column in the
colData slot.
table(colData(sce.clus)$cluster)
##
## clus1 clus2 clus3 clus4 clus5 clus6 clus7 clus8 clus9
## 60 62 68 33 30 30 16 25 18
Embedding plot of single cells colored by clusters in the cluster column in
colData.
plotUMAP(sce.clus, colour_by="cluster")
Figure 3: Embedding plot single cells
The cells are labeled by clusters in the cluster column in colData.
The spatial features in mouse brain aSVG are extracted. They are the bulk tissues to be matched with cell clusters.
svg.mus.brain <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
# Spatial features to match with single cell clusters.
feature.df <- return_feature(svg.path=svg.mus.brain)
## Accessing features...
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
## mus_musculus.brain.svg,
feature.df$feature
## [1] "path16294" "path16312" "path16316" "path5402" "medulla.oblongata" "cerebral.cortex"
## [7] "corpus.striatum" "diencephalon" "pituitary.gland" "hippocampus" "cerebellum" "brainstem"
## [13] "midbrain" "dorsal.plus.ventral.thalamus" "hypothalamus" "nose" "corpora.quadrigemina"
Create a named list to match cell cluster clus1 with aSVG feature
hypothalamus, and cluster clus3 with cerebral.cortex and midbrain.
lis.match.clus <- list('clus1'=c('hypothalamus'), 'clus3'=c('cerebral.cortex', 'midbrain'))
Aggregate cells by clusters defined in the cluster column.
[ThG: same problem as before the aggregate part is unclear.]
sce.clus.aggr <- aggr_rep(sce.clus, assay.na='logcounts', sam.factor='cluster', con.factor=NULL, aggr='mean')
Co-visualization of gene Cops5.
shm.lis <- spatial_hm(svg.path=svg.mus.brain, data=sce.clus.aggr, ID=c('Cops5'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=3, sce.dimred=sce.clus, dimred='PCA', cell.group='cluster', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match.clus, bar.width=0.11, dim.lgd.nrow=1)
## Coordinates: mus_musculus.brain.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
## Features in data not mapped: clus6, clus3, clus5, clus1, clus2, clus4, clus7, clus8, clus9
## ggplots/grobs: mus_musculus.brain.svg ...
## ggplot: Cops5, con
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## Cops5_con_1
## Converting "ggplot" to "grob" ...
##
## Converting "ggplot" to "grob" ...
## dim_Cops5_con_1
Figure 4: Co-visualizing single cells and bulk tissues by manually-detected cell clusters
The expression profiles of gene Cops5 are used.
[ThG: I don’t understand why this section is called ‘Manual’?]
In addition to the annatation and manual methods, an automatic method is provided that matches cells and bulk tissues for co-visualization. The automatic process is carried out by combining and co-clustering bulk and single cell data. The bulk and single cell data need to be derived from the same organ or tissue. If same organ, the single cell data are assayed on the whole organ. [ThG: why is that a requirement? You need to better explain what the requirements are here.] By contrast, if single cell data are from a whole tissue, the bulk tissues should be sub-tissues. To obtain optimal default settings for the automatic method, utilities for optimizing the automatic method are developed in spatialHeatmap. This section showcases application of the automatic method on mouse brain data with default settings obtained through optimization. The optimization on toy data and real data are explained in the Supplementary Section.
[ThG: would simplify and generalize this section to all three methods and move it to the introduction at the beginning. Most of what you are discribing here applies to all methods. By adjusting the illustration you can use the illustration for all 3 methods.]
Figure 5 is the automatic method (co-clustering) workflow overview. The inputs are raw count data (e.g. RNA-seq) of bulk tissues and single cells of the same organ (Figure 5.1). The single cells should come from the whole organ or at least covers the bulk tissues. The identities of each bulk tissue and each cell need to be labeled so as to evaluate the co-clustering performance. Bulk and single cell data are initially filtered at low strigency. The main difference between bulk and single cells is the sparsity in the latter. To reduce such difference, the bulk and single-cell data are combined, normalized, and then separated (Figure 5.2). After separation, the normalized bulk data are filtered to remove genes of low and constant expressions (Figure 5.3). The normalized single-cell data are also filtered to remove genes and cells having high zero-count rates. After filtered, the gene dimensionalities of single-cell data are reduced using PCA or UMAP method, and the top dimensionalities are clustered (Figure 5.4). In each cell cluster, cells having low similarities with other cells in the same cluster are filtered (Figure 5.5), and therefore the remaining clusters are more homogeneous (Figure 5.6). The filtered bulk and filtered single cells are combined and co-clustered (Figure 5.7).
The results include three types of co-clusters: 1) Two bulk tissues are clustered with cells. The source bulk is assigned to each cell according to Spearman’s correlation coefficient. For example, in Figure 5.8a bulk A is assigned to cell a1 because a1 has higher similarity with A than B. Since the true source bulk of a1 is A, this assignment is TRUE. By contrast, cell b1 also has higher similarity with A than B, and A is assigned to b1, but this assignment is labeled FALSE since the true source bulk of b1 is B; 2). Only one bulk tissue is clustered with cells. This bulk is assigned to all the cells in the same co-cluster (5.8b); and 3) No bulk is included. All these cells are discarded (Figure 5.8c). Lastly, the Spearman’s correlation coefficient and TRUE or FALSE assignments are used to create ROC plots and evaluate the performance (Figure 5.9). [ThG: ROCs can only be generated if the correct assingments are known. Readers will not understand why this is mentioned here and how they would use the tool since they don’t have the true tissue assignments.]
[ThG: also note the text in the above two paragraph needs major revisions with respect to grammar and clarity.]
Figure 5: Overview of coclustering
The coclustering are illustrated in 9 steps. Basically, bulk and single data are initially filtered, then combined, normalized, and separated. The normalized bulk and single cell data are filtered again. Single cells are clustered and resulting clusters are refined. Subsequently bulk and single cells are combined and co-clustered. In each co-cluster, bulk tissues are assigned to cells. The assignments are evaluated by ROC curves.
This section demonstrates co-visulization of bulk and single cells using the automatic method on mouse brain data. The bulk RNA-seq data are generated in a research on the impact of placental endocrine on mouse cerebellar development (Vacher et al. 2021) and the scRNA-seq data are from a study of mouse brain molecular atlas (Ortiz et al. 2020). Both bulk and single cell data sets are reduced for demonstration purpose. The optimal settings of fct, fil1, pca, knn, s0.2p0.8d13 in Table 2 are taken as default settings and used in this section.
To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.
set.seed(10)
Read bulk and single cell data. The bulk tissues and single cells are labeled with known identity so that the perforance of auto method can be evaluated. The same tissue or cell type should have the same label such as isocort. Overlaps between between bulk tissue and single cells labels are not allowed.
# Example bulk data.
blk.mus.pa <- system.file("extdata/shinyApp/example", "bulk_mouse_cocluster.txt", package="spatialHeatmap")
blk.mus <- as.matrix(read.table(blk.mus.pa, header=TRUE, row.names=1, sep='\t', check.names=FALSE))
blk.mus[1:3, 1:5]
## CERE.CORTEX HIPP HYPOTHA CERE CERE.CORTEX
## AI593442 177 256 50 24 285
## Actr3b 513 1465 228 244 666
## Adcy1 701 1243 57 1910 836
# Example single cell data.
sc.mus.pa <- system.file("extdata/shinyApp/example", "cell_mouse_cocluster.txt", package="spatialHeatmap")
sc.mus <- as.matrix(read.table(sc.mus.pa, header=TRUE, row.names=1, sep='\t', check.names=FALSE))
sc.mus[1:3, 1:5]
## isocort isocort isocort isocort olfa
## AI593442 2 2 2 5 0
## Actr3b 3 5 4 4 1
## Adcy1 3 6 6 6 0
Initial filtering at lower strigency.
blk.mus <- filter_data(data=blk.mus, sam.factor=NULL, con.factor=NULL, pOA=c(0.1, 5), CV=c(0.2, 100), verbose=FALSE)
mus.lis <- filter_cell(lis=list(sc.mus=sc.mus), bulk=blk.mus, gen.rm=NULL, min.cnt=1, p.in.cell=0.5, p.in.gen=0.1, verbose=FALSE)
Bulk and single cell are combined and normalized, then separated.
mus.lis.nor <- read_cache(cache.pa, 'mus.lis.nor')
if (is.null(mus.lis.nor)) {
mus.lis.nor <- norm_multi(dat.lis=mus.lis, cpm=FALSE)
save_cache(dir=cache.pa, overwrite=TRUE, mus.lis.nor)
}
Secondary filtering at higher strigency.
[ThG: unnecessary complication, why not doing the filtering
right from the start?]
blk.mus.fil <- filter_data(data=logcounts(mus.lis.nor$bulk), sam.factor=NULL, con.factor=NULL, pOA=c(0.1, 0.5), CV=c(0.2, 100), verbose=FALSE)
mus.lis.fil <- filter_cell(lis=list(sc.mus=logcounts(mus.lis.nor$sc.mus)), bulk=blk.mus.fil, gen.rm=NULL, min.cnt=1, p.in.cell=0.05, p.in.gen=0.02, verbose=FALSE)
The aSVG file of mouse brain.
[ThG: again use full sentences in paragraphs.]
svg.mus.brain <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
# Spatial features.
feature.df <- return_feature(svg.path=svg.mus.brain)
## Accessing features...
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
## mus_musculus.brain.svg,
The true matching between bulk tissues and single cells are defined in a table
and provided to the co-clustering workflow so that the performance can be
assessed.
[ThG: again users won’t have this. The optimization belongs
into the supplement.]
In addition, the matching between tissues and aSVG features
(SVGBulk) are defined in the same table, which is used to create the
co-visualization plot.
match.mus.brain.pa <- system.file("extdata/shinyApp/example", "match_mouse_brain_cocluster.txt", package="spatialHeatmap")
df.match.mus.brain <- read.table(match.mus.brain.pa, header=TRUE, row.names=1, sep='\t')
df.match.mus.brain
## SVGBulk cell trueBulk
## 1 cerebellum cere CERE
## 2 hippocampus hipp HIPP
## 3 cerebral.cortex isocort CERE.CORTEX
## 4 hippocampus retrohipp HIPP
## 5 hypothalamus hypotha HYPOTHA
Ensure the SVGBulk tissues are present in the aSVG file.
df.match.mus.brain$SVGBulk %in% feature.df$feature
## [1] TRUE TRUE TRUE TRUE TRUE
The processes of clustering single cells, refining cell clusters, and
co-clustering bulk and single cells (Figure 5) are
performed in a single function call. Setting return.all=TRUE returns a list
of refined cell clusters, ROC object, and a data.frame of auto-matching
results. If return.all=FALSE, a data.frame of parameter settings and
resulting AUC is returned. Details of individual steps are provided in the
Supplementary Section.
[ThG: now you are implying the ROC is a requirement? Users will stop here since they will not have it. ]
res.lis <- read_cache(cache.pa, 'res.lis')
if (is.null(res.lis)) {
res.lis <- cocluster(bulk=mus.lis.fil$bulk, cell=mus.lis.fil$sc.mus, df.match=df.match.mus.brain, df.para=NULL, sim=0.2, sim.p=0.8, dim=13, graph.meth='knn', dimred='PCA', sim.meth='spearman', return.all=TRUE, multi.core.par=MulticoreParam(workers=1, RNGseed=50), verbose=FALSE)
res.lis <- res.lis[[1]]
save_cache(dir=cache.pa, overwrite=TRUE, res.lis)
}
The co-clustering results are listed in roc.lis$df.roc. predictor is the similarity (Spearman’s or Pearson’s correlation coefficient) between bulk and cells within a co-cluster, which is used to assign bulk tissues to cells (Figure 5.8). response indicates whether the bulk assignment is correct according to the matching table. index is the cell index in the SingleCellExperiment after cell clusters are refined.
res.lis$df.roc[1:3, ]
## assignedBulk cell response predictor index trueBulk SVGBulk
## HYPOTHA HYPOTHA hypotha TRUE 0.5934066 2335 HYPOTHA hypothalamus
## HYPOTHA1 HYPOTHA hypotha TRUE 0.5054945 2380 HYPOTHA hypothalamus
## HYPOTHA2 HYPOTHA hypotha TRUE 0.7747253 2396 HYPOTHA hypothalamus
table(res.lis$df.roc$response)
##
## FALSE TRUE
## 26 639
ROC curve is created according to roc.lis$df.roc and the AUC value indicates the auto-matching performance, where higher AUC indicates better performance.
[ThG: I am lost. Why is the ROC part here? ]
plot(res.lis$roc.obj, print.auc=TRUE)
Figure 6: ROC curve of auto-matching on mouse brain data
The AUC value is a measure of accuracy on bulk assignments.
The function cocluster accepts multiple combinations of parameter settings provided in a data.frame, and co-clustering on these combinations can be performed in parallel on multiple cpu cores through multi.core.par.
Multiple combinations of parameter settings. If some parameters are not specified in this table such as graph.meth, their default settings will be used.
df.par <- data.frame(sim=c(0.2, 0.3), sim.p=c(0.8, 0.7), dim=c(12, 13))
df.par
## sim sim.p dim
## 1 0.2 0.8 12
## 2 0.3 0.7 13
The coclustering is run on 2 cpu cores (workers=2).
res.multi <- cocluster(bulk=mus.lis.fil$bulk, cell=mus.lis.fil$sc.mus, df.match=df.match.mus.brain, df.para=df.par, sc.dim.min=10, max.dim=50, sim=0.2, sim.p=0.8, dim=13, graph.meth='knn', dimred='PCA', sim.meth='spearman', return.all=TRUE, multi.core.par=MulticoreParam(workers=2, RNGseed=50), verbose=FALSE)
[ThG: users would expect to see a plot next, yet a new subsection starts now that lacks context. Is this an oversight?]
The co-clustering results can be tailored through “Lasso Select” on the
convenience Shiny app (desired_bulk_shiny) or manually defining desired bulk.
If the former, save cell.refined in an .rds file by saveRDS(cell.refined, file='cell.refined.rds') and upload cell.refined.rds to the Shiny app, where
tailoring instructions are provided.
[ThG: I don’t understand the above paragraph. The shiny part is totally out of context.]
Example of desired bulk downloaded from the convenience Shiny app. The x-y coordinates refer to single cells in embbeding plots (dimred).
desired.blk.pa <- system.file("extdata/shinyApp/example", "selected_cells_with_desired_bulk.txt", package="spatialHeatmap")
df.desired.bulk <- read.table(desired.blk.pa, header=TRUE, row.names=1, sep='\t')
df.desired.bulk[1:3, ]
## x y key desiredSVGBulk dimred
## 1 4.389422 6.917510 62 cerebellum PCA
## 2 4.586877 8.077207 75 cerebellum PCA
## 3 6.366188 7.695782 76 cerebellum PCA
If manually defining desired bulk, first visualize single cells in the embedding plot.
plot_dim(res.lis$cell.refined, dim='PCA', color.by='cell', x.break=seq(-10, 10, 2), y.break=seq(-10, 10, 2))
Figure 7: PCA embedding plot of mouse brain single cell data
Single cell data after cluster refining are plotted.
Manually define desired bulk for certain cells by x-y coordinates ranges (x.min, x.max, y.min, y.max) in the embedding plot. The dimred reveals where the coordinates come from and are required.
df.desired.bulk <- data.frame(x.min=c(-6, 6), x.max=c(-4, 10), y.min=c(6, -4), y.max=c(8, -2), desiredSVGBulk=c('cerebral.cortex', 'cerebral.cortex'), dimred='PCA')
df.desired.bulk
## x.min x.max y.min y.max desiredSVGBulk dimred
## 1 -6 -4 6 8 cerebral.cortex PCA
## 2 6 10 -4 -2 cerebral.cortex PCA
Extract cells with bulk assignments. If the df.desired.bulk argument is
assigned a value, the corrresponding assignments are incorporated in
res.lis$df.roc, and their response and predictor is set TRUE and 1
respectively. thr is a cutoff for the predictor in res.lis$df.roc, so
thr=0 denotes predictor is not filtered. true.only=TRUE indicates only
true assignments are extracted.
[ThG: lack of context; hard to follow; does this need to be here?]
sce.lis <- sub_asg(res.lis=res.lis, thr=0, df.desired.bulk=df.desired.bulk, df.match=df.match.mus.brain, true.only=TRUE)
## No cells selected for row: 1!
## No cells selected for row: 2!
Aggregate extracted cells by SVGBulk. The aggregated cells are equivalent to
bulk tissues (spatial features) in the aSVG. The aggregated abundance profiles
are used to color matching bulk tissues in the aSVG image.
[ThG: I am unable follow. Please provide clearer explantions to help readers what is done here or remove this part. The fact that you are not showing any output or visualization in this and other code sections makes it very hard to understand what is happening throughout.]
sce.aggr <- aggr_rep(data=sce.lis$cell.sub, assay.na='logcounts', sam.factor='SVGBulk', con.factor=NULL, aggr='mean')
Co-visualize bulk and single cells with aggregated abundance profiles of gene
Adcy1. tar.bulk refers to the target bulk in aSVG and all corresponding
cells are highlighted in the embedding plot. Cells with true assignments of
tar.bulk are colored according to the color key, while other corresponding
cells with false or without assignments are colored black. All other cells not
corresponding to tar.bulk are in gray. In the embedding plot, the top and
right clusters of red cells are the cells defined in df.desired.bulk.
shm.lis1 <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr, ID=c('Adcy1'), legend.nrow=4, sce.dimred=sce.lis$cell.refined, dimred='PCA', assay.na='logcounts', tar.bulk=c('cerebral.cortex'), profile=TRUE, dim.lgd.text.size=10, dim.lgd.nrow=1, bar.width=0.1)
## Coordinates: mus_musculus.brain.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
## ggplots/grobs: mus_musculus.brain.svg ...
## ggplot: Adcy1, con
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## Adcy1_con_1
## Converting "ggplot" to "grob" ...
##
## Converting "ggplot" to "grob" ...
## dim_Adcy1_con_1
Figure 8: Co-visualizing bulk and single cells of mouse brain with abundance profiles
The aggregated expression profiles of gene Adcy1 are plotted.
Co-visualize bulk and single cells without abundance profiles.
shm.lis2 <- spatial_hm(svg.path=svg.mus.brain, data=sce.aggr, ID=c('Adcy1'), legend.nrow=4, sce.dimred=sce.lis$cell.refined, dimred='PCA', tar.bulk=c('cerebral.cortex'), profile=FALSE, dim.lgd.text.size=10, dim.lgd.nrow=1)
## Coordinates: mus_musculus.brain.svg ...
## CPU cores: 1
## Element "a" is removed: a4174 !
##
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted!
## EFO_0000530
##
## ggplots/grobs: mus_musculus.brain.svg ...
## ggplot: Adcy1, con
## Legend plot ...
## CPU cores: 1
## Converting "ggplot" to "grob" ...
## Adcy1_con_1
## Converting "ggplot" to "grob" ...
##
## Converting "ggplot" to "grob" ...
## dim_mus_musculus.brain.svg
Figure 9: Co-visualizing bulk and single cells of mouse brain without abundance profiles
The matching between cells and SVG bulk is denoted by color in-between.
When using the Shiny app, single cell data in annotation method or combined
single cell and bulk data in automatic method are saved in an .rds file of
SingleCellExperiment object by saveRDS. In addition, in the automatic
method bulk tissues and single cells are labeled by bulk and cell
respectively by the bulkCell column in colData slot. The true matching
table is stored in the metadata list with the name df.match. The example
below illustrates these rules.
[ThG: this section would need to show a screenshot showing where this is used in the shiny app.]
sce.auto <- readRDS(system.file("extdata/shinyApp/example", 'sce_auto_bulk_cell_mouse_brain.rds', package="spatialHeatmap"))
colData(sce.auto)
## DataFrame with 4466 rows and 1 column
## bulkCell
## <character>
## CERE.CORTEX bulk
## HIPP bulk
## HYPOTHA bulk
## CERE bulk
## CERE.CORTEX bulk
## ... ...
## retrohipp cell
## retrohipp cell
## retrohipp cell
## retrohipp cell
## retrohipp cell
metadata(sce.auto)$df.match
## SVGBulk cell trueBulk
## 1 cerebellum cere CERE
## 2 hippocampus hipp HIPP
## 3 cerebral.cortex isocort CERE.CORTEX
## 4 hippocampus retrohipp HIPP
## 5 hypothalamus hypotha HYPOTHA
This section presents individual steps in the annotation method. Since the results are the same with the main section, the following code is not evaluated.
To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning.
set.seed(10)
Read the example single cell data.
sce.pa <- system.file("extdata/shinyApp/example", "sce_manual_mouse.rds", package="spatialHeatmap")
sce <- readRDS(sce.pa)
The following routine steps in single-cell data analysis are learned from Bioconductor OCSA. Since these steps are not the focus, details are not explained.
Quality control through mitochondria and spike-in genes.
stats <- perCellQCMetrics(sce, subsets=list(Mt=rowData(sce)$featureType=='mito'), threshold=1)
sub.fields <- 'subsets_Mt_percent'
ercc <- 'ERCC' %in% altExpNames(sce)
if (ercc) sub.fields <- c('altexps_ERCC_percent', sub.fields)
qc <- perCellQCFilters(stats, sub.fields=sub.fields, nmads=3)
# Discard unreliable cells.
colSums(as.matrix(qc))
sce <- sce[, !qc$discard]
Normalization.
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, cluster=clusters)
sce <- logNormCounts(sce)
Dimensionality reduction with PCA, UMAP and tSNE.
# Variance modelling.
df.var <- modelGeneVar(sce)
top.hvgs <- getTopHVGs(df.var, prop = 0.1, n = 3000)
# Dimensionality reduction.
sce <- denoisePCA(sce, technical=df.var, subset.row=top.hvgs, min.rank=13, max.rank=50, assay.type = "logcounts")
sce <- runUMAP(sce, dimred = "PCA", ncomponents=min.dim)
sce <- runTSNE(sce, dimred="PCA")
The manual method has the same steps with the annotation method except for a subsequently clustering step. This section showcases individual steps in the clustering step.
Cell clusters are detected by first building a graph object then partitioning the graph, where cells are nodes in the graph.
# Build graph.
snn.gr <- buildSNNGraph(sce.manual, use.dimred="PCA")
plot(snn.gr)
# Partition graph to detect cell clusters.
cluster <- paste0('clus', cluster_walktrap(snn.gr)$membership)
table(cluster)
Cell cluster assignments are stored in the cluster column in colData slot of SingleCellExperiment. If there are experimental variables such as treatments or time points, they should be stored in the expVar column.
cdat <- colData(sce.manual)
lab.lgc <- 'label' %in% make.names(colnames(cdat))
if (lab.lgc) {
cdat <- cbind(cluster=cluster, colData(sce.manual))
idx <- colnames(cdat) %in% c('cluster', 'label')
cdat <- cdat[, c(which(idx), which(!idx))]
} else cdat <- cbind(cluster=cluster, colData(sce.manual))
colnames(cdat) <- make.names(colnames(cdat))
colData(sce.manual) <- cdat; cdat[1:3, ]
Since real optimizations have high demand on computing power and take a long time, it is demonstrated on toy data. Thus the result parameter settings may not be really optimal. The example bulk and single cell RNA-seq data are from Arabidopsis thaliana (Arabidopsis) root. Bulk tissue data comprise all the major root tissues such as epidermis, cortex, endodermis, xylem, columella, which are generated in a research on alternative splicing and lincRNA regulation (Li et al. 2016). The two single cell data sets are derived from the whole root, which are produced in a study of single cell Arabidopsis root atlas (Shahan et al. 2020). The identities of bulk and single cells are all labeled.
The optimization focuses on parameters of normalization methods, filtering, dimensionality reduction methods, refining homogeneous cell clusters, number of top dimensionalities in co-clustering, graph-building methods in co-clustering. The optimization is performed by running the co-clustering workflow (Figure 5) on each of the single cell data sets. The parameter settings being optimized is fixed and all settings of other parameters are varied across all possible combinations.
Each running of the workflow yields an AUC value, thus after running the workflow on all possible settings combinations one parameter settings has a set of AUC values. The AUCs are filtered according to some criteria and the remaining AUCs are averaged. A settings with a higher mean AUC than its counterparts are taken as optimal in a parameter. For example, when optimizing dimensionality reduction methods, the settings are PCA and UMAP. If the mean of remaining AUCs of PCA is 0.6 while UMAP is 0.55, PCA is regarded as the optimal.
Since optimzation on example data also takes a relatively long time, most of the following steps are not evaluated. A common computer with 4G memory and 4 CPUs is enough to run the following optimization process.
To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.
set.seed(10)
Read bulk and two single cell data.
blk <- readRDS(system.file("extdata/cocluster/data", "bulk_cocluster.rds", package="spatialHeatmap")) # Bulk.
sc10 <- readRDS(system.file("extdata/cocluster/data", "sc10_cocluster.rds", package="spatialHeatmap")) # Single cell.
sc11 <- readRDS(system.file("extdata/cocluster/data", "sc11_cocluster.rds", package="spatialHeatmap")) # Single cell.
blk; sc10; sc11
## class: SummarizedExperiment
## dim: 2805 45
## metadata(0):
## assays(1): counts
## rownames(2805): AT1G01070 AT1G01120 ... ATCG00650 ATCG00770
## rowData names(0):
## colnames(45): PHLM_COMP PHLM_COMP ... HAIR CORT
## colData names(0):
## class: SingleCellExperiment
## dim: 577 1893
## metadata(0):
## assays(1): counts
## rownames(577): AT1G01470 AT1G02400 ... AT5G66870 AT5G67080
## rowData names(0):
## colnames(1893): atricho endo ... atricho cortex
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## class: SingleCellExperiment
## dim: 577 2364
## metadata(0):
## assays(1): counts
## rownames(577): AT1G01470 AT1G02400 ... AT5G66870 AT5G67080
## rowData names(0):
## colnames(2364): atricho tricho ... endo atricho
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
These example data are already pre-processed. To demonstrate the optimization process the pre-processing steps are perfomed again with few genes or cells removed.
Inital filtering with low strigency before normalization.
blk <- filter_data(data=blk, pOA=c(0.2, 15), CV=c(1.5, 100))
fil.init <- filter_cell(lis=list(sc10=sc10, sc11=sc11), bulk=blk, gen.rm='^ATCG|^ATCG', min.cnt=1, p.in.cell=0.3, p.in.gen=0.1); fil.init
Combine and normalize bulk and single cell data, then separate them. By default computeSumFactors (fct) in scran package is used (Lun, McCarthy, and Marioni 2016). If cpm=TRUE, additional normalization of counts per million is applied.
norm.fct <- norm_multi(dat.lis=fil.init, cpm=FALSE) # fct.
norm.cpm <- norm_multi(dat.lis=fil.init, cpm=TRUE) # fct + cpm
Secondary filtering with higher strigency after normalization. Four sets of filtering parameter settings are created. In bulk data, genes with expression values over A across samples of over proportion p and with coefficinet of variance (CV) between cv1 and cv2 are retained. In cell data, genes with expression values over min.cnt of at least proportion p.in.gen are retained, and cells with with expression values over min.cnt of at least proportion p.in.cell are retained.
df.par.fil <- data.frame(p=c(0.1, 0.2, 0.3, 0.4), A=rep(1, 4), cv1=c(0.1, 0.2, 0.3, 0.4), cv2=rep(100, 4), min.cnt=rep(1, 4), p.in.cell=c(0.1, 0.25, 0.3, 0.35), p.in.gen=c(0.01, 0.05, 0.1, 0.15))
df.par.fil
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
## 4 0.4 1 0.4 100 1 0.35 0.15
Filter bulk and cell data using the four filtering settings. The results are automatically saved in the working directory wk.dir and are recognized in the downstream. Thus the working directory should be the same across the entire workflow.
if (!dir.exists('opt_res')) dir.create('opt_res')
fct.fil.all <- filter_iter(bulk=norm.fct$bulk, cell.lis=list(sc10=norm.fct$sc10, sc11=norm.fct$sc11), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_res', norm.meth='fct')
cpm.fil.all <- filter_iter(bulk=norm.cpm$bulk, cell.lis=list(sc10=norm.cpm$sc10, sc11=norm.cpm$sc11), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_res', norm.meth='cpm')
To evaluate the downstream auto-matching performance, a ground-truth matching relationship is required in form of data.frame. The cell and trueBulk refer to bulk tissue identifiers in aSVG files, single cell identifiers and bulk tissue identifiers in the data for co-clustering, respectively. If a cell matches multiple bulk tissues, bulk identifiers are separated by comma, semicolon, or single space such as NONHAIR,LRC_NONHAIR. The SVGBulk is the bulk identifiers in aSVG files, which are recognized in co-visualization.
match.pa <- system.file("extdata/cocluster/data", "match_arab_root_cocluster.txt", package="spatialHeatmap")
df.match.arab <- read.table(match.pa, header=TRUE, row.names=1, sep='\t')
df.match.arab[1:3, ]
## SVGBulk cell trueBulk
## 1 NONHAIR atricho NONHAIR,LRC_NONHAIR
## 2 COLU colu.dist.colu COLU
## 3 COLU colu.dist.lrc COLU
In real application, the whole optimization takes a long time and requires a lot of computation power. For example, combined bulk and cell data with 6945 genes and 7747 samples requires about 20G memory for coclustering. To speed up computation, the optimization function coclus_opt provides two levels of parallel computing through BiocParallel (Morgan et al. 2021). The first one is BatchtoolsParam and relies on a cluster scheduler such as the SLURM scheduler and the second one utilizes MulticoreParam.
Before optimzation, users could check the parallelization guide by setting parallel.info=TRUE, then it returns the max possible parallelizations for each level respectively.
coclus_opt(wk.dir='opt_res', parallel.info=TRUE, dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1))
A SLURM template is provided as an example for the first level parallelization. Users are advised to make a new copy and set SLURM parameters in the new copy. If users have access to other cluster schedulers, the template should be provided accordingly.
file.copy(system.file("extdata/cocluster", "slurm.tmpl", package="spatialHeatmap"), './slurm.tmpl')
Below is the demonstration of two-level parallelization on SLURM. For instance, the first- and second-level parallelizations are set 3 and 2 cpu cores respectively. The wk.dir is the same in secondary filtering.
sim and sim.p are parameters in refining cell clusters (Figure 5.5). Specifically, in a cell cluster, cells having similarities over sim with other cells in the same cluster of at least proportion sim.p would remain. sim is Spearman’ or Pearson’s correlation coefficient. dim is the number of top dimensionalities (equivalent to genes) in co-clustering. Since the three parameters are related to each other, they are treated as a set spd.set.
opt <- coclus_opt(wk.dir='opt_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1), df.match=df.match.arab, batch.par=BatchtoolsParam(workers=3, cluster="slurm", template='slurm.tmpl', RNGseed=100, stop.on.error = FALSE, log =TRUE, logdir=file.path('opt_res', 'batch_log')), multi.core.par=MulticoreParam(workers=2), verbose=FALSE)
If no cluster scheduler is available, optimization can be parallelized only at the second-level by setting batch.par=NULL.
opt <- coclus_opt(wk.dir='opt_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1), df.match=df.match.arab, batch.par=NULL, multi.core.par=MulticoreParam(workers=2))
The performace of each combination of parameter settings on each single cell data set is measured by an AUC value in ROC curve. These AUCs are filtered according to a cutoff (aucs over 0.5) and corresponding total bulk assignments (total.min) and total true assignments (true.min). The following demonstrates how to visualize the AUCs and select optimal parameter settings.
Extract AUCs for each filtering settings across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.fil <- auc_stat(wk.dir='opt_res', tar.par='filter', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each filtering settings and AUC cutoff.
df.lis.fil$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.fil[[1]], bar.width=0.07, title='Mean AUCs by filtering settings', x.text.size=15, y.text.size=15, lgd.text.size=15)
Figure 10: Mean AUCs of filtering settings
One bar refers to mean AUCs of a filtering settings at a certain AUC cutoff.
All AUCs by each filtering settings and AUC cutoff.
auc_violin(df.lis=df.lis.fil, xlab='Filtering settings', x.text.size=13, y.text.size=13)
Figure 11: All AUCs of filtering settings
A violin plot refers to all AUCs of a filtering settings at a certain AUC cutoff.
According to the mean AUCs, optimal filtering settings are fil1, fil2, fil3.
df.par.fil[c(1, 2, 3), ]
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
Extract AUCs for normalization methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.norm <- auc_stat(wk.dir='opt_res', tar.par='norm', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each normalization method and AUC cutoff.
df.lis.norm$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.norm[[1]], bar.width=0.07, title='Mean AUCs by normalization methods', x.text.size=15, y.text.size=15, lgd.text.size=15)
All AUCs by each normalization method and AUC cutoff.
auc_violin(df.lis=df.lis.norm, xlab='Normalization methods', x.text.size=13, y.text.size=13)
Optimal normalization method: fct (computeSumFactors).
Extract AUCs for graph-building methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.graph <- auc_stat(wk.dir='opt_res', tar.par='graph', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each graph-building method and AUC cutoff.
df.lis.graph$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.graph[[1]], bar.width=0.07, title='Mean AUCs by graph-building methods')
All AUCs by each graph-building method and AUC cutoff.
auc_violin(df.lis=df.lis.graph, xlab='Graph-building methods')
Optimal graph-building methods: knn (buildKNNGraph).
Extract AUCs for dimensionality reduction methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.dimred <- auc_stat(wk.dir='opt_res', tar.par='dimred', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each dimensionality reduction method and AUC cutoff.
df.lis.dimred$df.auc.mean[1:3, ]
# Mean AUCs by each dimensionality reduction method and AUC cutoff.
mean_auc_bar(df.lis.dimred[[1]], bar.width=0.07, title='Mean AUCs by dimensionality reduction methods')
All AUCs by each dimensionality reduction method and AUC cutoff.
auc_violin(df.lis=df.lis.dimred, xlab='Dimensionality reduction')
Optimal dimensionality reduction method: pca (denoisePCA).
Extract AUCs for spd.set across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.spd <- auc_stat(wk.dir='opt_res', tar.par='spd.set', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
df.lis.spd$auc0.5$df.frq[1:3, ]
All AUCs of top five spd.sets ranked by frequency across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
spd_auc_violin(df.lis=df.lis.spd, n=5, xlab='spd.sets', x.vjust=0.6)
Top five spd.sets across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively are taken as optimal spd.sets.
n <- 5; df.spd.opt <- NULL
for (i in df.lis.spd) {
df.spd.opt <- rbind(df.spd.opt, i$df.frq[seq_len(n), c('sim', 'sim.p', 'dim')])
}
df.spd.opt$spd.set <- paste0('s', df.spd.opt$sim, 'p', df.spd.opt$sim.p, 'd', df.spd.opt$dim)
df.spd.opt <- subset(df.spd.opt, !duplicated(spd.set))
df.spd.opt[1:3, ]
In real application, the optimized settings need to be validated on data sets from other organs of different species, which is presented below.
Ideally, the co-clustering should be optimized on different organs from different organisms as many possible. The single cell data need to be generated on whole organs and each cell’s identity need to be labeled. Such data are less common and not easy to obtain in public databases, since most single cell RNA-seq (scRNA-seq) assays only focus on specific cell populations rather than whole organs, which are isolated by microdissection or fluorescent assisted cell sorting (FACS). As a result, the co-clustering optimization is performed only on five single cell data sets of Arabidopsis thaliana (Arabidopsis) root. The optimized parameter settings are validated on mouse brain and kidney.
The optimization in real case has high demand on computing power and takes a long time, so most of the following steps are not evaluated. The following steps are not explained in details since they are the same as last section.
The bulk (Li et al. 2016) and five single cell (Shahan et al. 2020) data sets of Arabidopsis root are accessed from the same studies as last section. Details about how to access and format them are described here. In the following, blk.arb.rt refers to bulk data and sc.arab.rt10, sc.arab.rt11, sc.arab.rt12, sc.arab.rt30, sc.arab.rt31 refers to the five single cell data sets respectively.
To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.
set.seed(10)
Inital filtering with low strigency before normalization.
blk.arab.rt <- filter_data(data=blk.arab.rt, pOA=c(0.05, 5), CV=c(0.05, 100))
fil.init <- filter_cell(lis=list(sc10=sc.arab.rt10, sc11=sc.arab.rt11, sc12=sc.arab.rt12, sc30=sc.arab.rt30, sc31=sc.arab.rt31), bulk=blk, gen.rm='^ATCG|^ATCG', min.cnt=1, p.in.cell=0.01, p.in.gen=0.05); fil.init
Combine and normalize bulk and single cell data, then separate them.
norm.fct <- norm_multi(dat.lis=fil.init, cpm=FALSE) # fct.
norm.cpm <- norm_multi(dat.lis=fil.init, cpm=TRUE) # fct + cpm
Secondary filtering with higher strigency after normalization. Four sets of filtering parameter settings are created.
df.par.fil <- data.frame(p=c(0.1, 0.2, 0.3, 0.4), A=rep(1, 4), cv1=c(0.1, 0.2, 0.3, 0.4), cv2=rep(100, 4), min.cnt=rep(1, 4), p.in.cell=c(0.1, 0.25, 0.3, 0.35), p.in.gen=c(0.01, 0.05, 0.1, 0.15))
df.par.fil
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
## 4 0.4 1 0.4 100 1 0.35 0.15
Filter bulk and cell data using the four filtering settings. The results are automatically saved in the working directory wk.dir and are recognized in the downstream. Thus the working directory should be the same across the entire workflow.
if (!dir.exists('opt_real_res')) dir.create('opt_real_res')
fct.fil.all <- filter_iter(bulk=norm.fct$bulk, cell.lis=list(sc10=norm.fct$sc10, sc11=norm.fct$sc11, sc12=norm.fct$sc12, sc30=norm.fct$sc30, sc31=norm.fct$sc31), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_real_res', norm.meth='fct')
cpm.fil.all <- filter_iter(bulk=norm.cpm$bulk, cell.lis=list(sc10=norm.cpm$sc10, sc11=norm.cpm$sc11, sc12=norm.cpm$sc12, sc30=norm.cpm$sc30, sc31=norm.cpm$sc31), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_real_res', norm.meth='cpm')
Ground-truth matching relationship across cell, trueBulk, and SVGBulk.
match.pa <- system.file("extdata/cocluster/data", "match_arab_root_cocluster.txt", package="spatialHeatmap")
df.match.arab <- read.table(match.pa, header=TRUE, row.names=1, sep='\t')
df.match.arab[1:3, ]
## SVGBulk cell trueBulk
## 1 NONHAIR atricho NONHAIR,LRC_NONHAIR
## 2 COLU colu.dist.colu COLU
## 3 COLU colu.dist.lrc COLU
The max possible parallelizations for each level respectively.
coclus_opt(wk.dir='opt_real_res', parallel.info=TRUE, dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1))
Take the SLURM scheduler as an example for two-level parallelizatio. Make a new copy of the default SLURM template and set parameters in the new copy.
file.copy(system.file("extdata/cocluster", "slurm.tmpl", package="spatialHeatmap"), './slurm.tmpl')
The first- and second-level parallelizations are set 3 and 2 cpu cores respectively. The wk.dir is the same in secondary filtering. Note the settings of spd.set (sim/sim.p/dim) has wider ranges than in last section. The parallel computation is performed at High-Performance Computing Center (HPCC) at University of California, Riverside.
opt <- coclus_opt(wk.dir='opt_real_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1), df.match=df.match.arab, batch.par=BatchtoolsParam(workers=3, cluster="slurm", template='slurm.tmpl', RNGseed=100, stop.on.error=FALSE, log=TRUE, logdir=file.path('opt_res', 'batch_log')), multi.core.par=MulticoreParam(workers=2), verbose=FALSE)
If no cluster scheduler is not available, optimization can be parallelized only at the second-level by setting batch.par=NULL.
opt <- coclus_opt(wk.dir='opt_real_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1), df.match=df.match.arab, batch.par=NULL, multi.core.par=MulticoreParam(workers=2))
The performace of each combination of parameter settings on each single cell data set is measured by an AUC value in ROC curve. These AUCs are filtered according to a cutoff (aucs over 0.5) and corresponding total bulk assignments (total.min) and total true assignments (true.min). The following demonstrates how to visualize the AUCs and select optimal parameter settings.
Extract AUCs for each filtering settings across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.fil <- auc_stat(wk.dir='opt_real_res', tar.par='filter', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each filtering settings and AUC cutoff.
df.lis.fil$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.fil[[1]], bar.width=0.07, title='Mean AUCs by filtering settings')
Figure 12: Mean AUCs of filtering settings in real optimization
A bar refers to mean AUCs of a filtering settings at a certain AUC cutoff.
All AUCs by each filtering settings and AUC cutoff.
auc_violin(df.lis=df.lis.fil, xlab='Filtering settings')
Figure 13: All AUCs of filtering settings in real optimization
A violin plot refers to all AUCs of a filtering settings at a AUC cutoff.
According to the mean AUCs, fil1, fil2, and fil3 are selected as optimal filtering settings.
df.par.fil[c(1, 2, 3), ]
## p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100 1 0.10 0.01
## 2 0.2 1 0.2 100 1 0.25 0.05
## 3 0.3 1 0.3 100 1 0.30 0.10
Extract AUCs for normalization methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.norm <- auc_stat(wk.dir='opt_real_res', tar.par='norm', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each normalization method and AUC cutoff.
df.lis.norm$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.norm[[1]], bar.width=0.07, title='Mean AUCs by normalization methods')
All AUCs by each normalization method and AUC cutoff.
auc_violin(df.lis=df.lis.norm, xlab='Normalization methods')
Optimal normalization method: fct (computeSumFactors).
Extract AUCs for graph-building methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.graph <- auc_stat(wk.dir='opt_real_res', tar.par='graph', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each graph-building method and AUC cutoff.
df.lis.graph$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.graph[[1]], bar.width=0.07, title='Mean AUCs by graph-building methods')
All AUCs by each graph-building method and AUC cutoff.
auc_violin(df.lis=df.lis.graph, xlab='Graph-building methods')
Since knn (buildKNNGraph) and snn (buildSNNGraph) have similar mean AUCs, both are selected as optimal graph-building methods.
Extract AUCs for dimensionality reduction methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.dimred <- auc_stat(wk.dir='opt_real_res', tar.par='dimred', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
Mean AUCs by each dimensionality reduction method and AUC cutoff.
df.lis.dimred$df.auc.mean[1:3, ]
# Mean AUCs by each dimensionality reduction method and AUC cutoff.
mean_auc_bar(df.lis.dimred[[1]], bar.width=0.07, title='Mean AUCs by dimensionality reduction methods')
All AUCs by each dimensionality reduction method and AUC cutoff.
auc_violin(df.lis=df.lis.dimred, xlab='Dimensionality reduction')
Optimal dimensionality reduction method: pca (denoisePCA).
Extract AUCs for spd.set across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
df.lis.spd <- auc_stat(wk.dir='opt_real_res', tar.par='spd.set', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
df.lis.spd$auc0.5$df.frq[1:3, ]
All AUCs of top five spd.sets ranked by frequency across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.
spd_auc_violin(df.lis=df.lis.spd, n=5, xlab='spd.sets', x.vjust=0.6)
Top five spd.sets across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively are taken as optimal spd.sets. s, p, d stands for sim, sim.p, dim respectively. E.g. s0.2p0.5d12 means sim = 0.2, sim.p = 0.5, dim = 12.
n <- 5; df.spd.opt <- NULL
for (i in df.lis.spd) {
df.spd.opt <- rbind(df.spd.opt, i$df.frq[seq_len(n), c('sim', 'sim.p', 'dim')])
}
df.spd.opt$spd.set <- paste0('s', df.spd.opt$sim, 'p', df.spd.opt$sim.p, 'd', df.spd.opt$dim)
df.spd.opt <- subset(df.spd.opt, !duplicated(spd.set))
df.spd.opt[1:3, ]
## sim sim.p dim spd.set
## 1 0.2 0.2 5 s0.2p0.2d5
## 22 0.2 0.5 5 s0.2p0.5d5
## 57 0.2 0.3 6 s0.2p0.3d6
The optimal parameter settings at this stage are listed in the table below.
| normalization | filtering.set | dimensionality.reduction | graph.building | spd.set |
|---|---|---|---|---|
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d5 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d5 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.3d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.6d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.3d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.4p0.2d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.3d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.8d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.8d10 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.5d16 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d14 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.4p0.2d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d12 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.2d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.4p0.2d12 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d16 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d23 |
Next, these optimal settings are validated on mouse brain, mouse kindney, and Arabidopsis root data sets. In mouse brain, the bulk RNA-seq data are generated in a research on the impact of placental endocrine on mouse cerebellar development (Vacher et al. 2021) and the scRNA-seq data are from a study of mouse brain molecular atlas (Ortiz et al. 2020). The bulk count data are produced using systemPipeR (2.1.12) (Backman and Girke 2016). Details about how to access and format bulk and single data are described here. In the following, blk.mus.brain and sc.mus.brain refers to bulk and single cell data respectively. The validation is performed by applying these optimal settings on the same coclustering workflow, so the following procedures are not detailed.
Initial filtering.
blk.mus.brain <- filter_data(data=blk.mus.brain, pOA=c(0.05, 5), CV=c(0.05, 100))
mus.brain.lis <- filter_cell(lis=list(sc.mus=sc.mus.brain), bulk=blk.mus.brain, gen.rm=NULL, min.cnt=1, p.in.cell=0.01, p.in.gen=0.05, verbose=FALSE)
Bulk and single cell are combined and normalized, then separated.
mus.brain.lis.nor <- norm_multi(dat.lis=mus.brain.lis, cpm=FALSE)
Secondary filtering. Since fil1 and fil2 exhibit similar performaces, only fil1 is used.
blk.mus.brain.fil <- filter_data(data=mus.brain.lis.nor$bulk, pOA=c(0.1, 1), CV=c(0.1, 100), verbose=FALSE)
mus.brain.lis.fil <- filter_cell(lis=list(sc.mus=mus.brain.lis.nor$sc.mus), bulk=blk.mus.brain.fil, gen.rm=NULL, min.cnt=1, p.in.cell=0.1, p.in.gen=0.01, verbose=FALSE)
Matching table indicating true bulk tissues of each cell type and corresponding SVG bulk (spatial feature).
match.mus.brain.pa <- system.file("extdata/shinyApp/example", "match_mouse_brain_cocluster.txt", package="spatialHeatmap")
df.match.mus.brain <- read.table(match.mus.brain.pa, header=TRUE, row.names=1, sep='\t')
df.match.mus.brain
## SVGBulk cell trueBulk
## 1 cerebellum cere CERE
## 2 hippocampus hipp HIPP
## 3 cerebral.cortex isocort CERE.CORTEX
## 4 hippocampus retrohipp HIPP
## 5 hypothalamus hypotha HYPOTHA
Since knn and snn display similar performances, only knn is used. All optimal spd.set settings in Table 1 are tested, and results are shown in Figure 14a.
mus.brain.df.para <- cocluster(bulk=mus.brain.lis.fil$bulk, cell=mus.brain.lis.fil$sc.mus, df.match=df.match.mus.brain, df.para=df.spd.opt[, c('sim', 'sim.p', 'dim')], graph.meth='knn', dimred='PCA', return.all=FALSE, multi.core.par=MulticoreParam(workers=2))
In mouse kidney, four bulk tissues are selected: proximal straight tubule in cortical medullary rays (PTS2), cortical collecting duct (CCD), and cortical thick ascending limb of the loop of Henle (cTAL), glomerulus. PTS2 data are from a research on cell-type selective markers in mouse kidney (Clark et al. 2019), CCD and cTAL are from transcriptome analysis of major renal collecting duct cell types in mouse kidney (Chen et al. 2017), and glomerulus is from a transcriptome atlas study of mouse glomerulus (Karaiskos et al. 2018). The FASTQ files of the four tissues are downloaded from original studies and raw count data are generated with systemPipeR (2.1.12) (Backman and Girke 2016). The single cell data are accessed from an investigation in cellular targets of mouse kidney metabolic acidosis (Park et al. 2018). Details about how to access and format bulk and single data are described here.
The validating procedures on mouse kindey are same with mouse brain except that after initial filtering replicates in each bulk are reduced to 3 by using function reduce_rep due to two many replicates. The results are shown in Figure 14b.
In Arabidopsis root, the same bulk tissues (Li et al. 2016) and two additional single cell data sets (sc9, sc51, (Shahan et al. 2020)) from the same studies as real optimization are used. Details about how to access and format them are described here. The procedures of validating optimized settings are the same with mouse brain except that in normalization two single cell data sets are used instead of one. The results are shown in Figure 14c and d.
As comparisons, random combinations of non-optimal settings are generated and tested. In filtering, fil4 is regarded non-optimal, but it filters out too many genes so that the coclustering procedures cannot run. Thus fil3 is used in the random settings. The graph-building methods have two settings knn and snn, and both are taken as optimal, thus they are all used for generating random combinations. See details here.
df.par.rdn <- random_para(fil.set=c('fil3'), norm='cpm', dimred='UMAP', graph.meth=c('knn', 'snn'), sim=round(seq(0.2, 0.8, by=0.1), 1), sim.p=round(seq(0.2, 0.8,by=0.1), 1), dim=seq(5, 40, by=1), df.spd.opt=df.spd.opt)
df.par.rdn[1:3, ]
These random settings are tested on each of the four validating data sets, where other settings such as initial filtering are not changed. The results are shown in Figure 15c and d.
The AUCs of optimal and random settings are presented in Figure 14 and Figure 15 respectively. In both figures, if total bulk assignments < 500 or total true assignments < 300 or AUC < 0.5, AUCs are set 0. It is clear that the optimal settings exhibit better performance than random settings, so the optimization workflow and results are reliable to some extent. In Figure 14, asterisks indicate optimal settings have AUCs >= 0.5, total bulk assignments >= 500, and total true assignments >= 300 across all four data sets. These settings are regarded as final optimal settings (Table 2).
Figure 14: Validating optimal settings
AUCs of optimal settings on each validating data sets. A bar refers to one AUC of one optimal settings. Asterisks indicate optimal settings have AUC >= 0.5, total bulk assignments >= 500, and total true assignments >= 300 across all four data sets. If total bulk assignments < 500 or total true assignments < 300 or AUC < 0.5, AUCs are set 0. (a) Mouse brain. (b) Mouse kidndy. (c) Arabidopsis root of single cell 9. (d) Arabidopsis root of single cell 51.
Figure 15: Random settings
AUCs of random settings on each validating data sets. A bar refers to one AUC of one random settings. If total bulk assignments < 500 or total true assignments < 300 or AUC < 0.5, AUCs are set 0. (a) Mouse brain. (b) Mouse kidndy. (c) Arabidopsis root of single cell 9. (d) Arabidopsis root of single cell 51.
| normalization | filtering.set | dimensionality.reduction | graph.building | spd.set |
|---|---|---|---|---|
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d5 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.6d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.5d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d6 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.3d7 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.8d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.5d16 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d14 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d12 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.4d13 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.2d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.3p0.7d17 |
| fct | fil1, fil2, fil3 | pca | knn, snn | s0.2p0.2d23 |
This section illustrates individual steps in the application of automatic method on the mouse brain data.
The aSVG file of mouse brain.
svg.mus.brain <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
# Spatial features.
feature.df <- return_feature(svg.path=svg.mus.brain)
The true matching between bulk tissues and single cells are defined in a table and provided to the co-clustering workflow so that the performance can be assessed. In addition, the matching between tissues and aSVG features (SVGBulk) are defined in the same table, which is used to create the co-visualization plot.
match.mus.brain.pa <- system.file("extdata/shinyApp/example", "match_mouse_brain_cocluster.txt", package="spatialHeatmap")
df.match.mus.brain <- read.table(match.mus.brain.pa, header=TRUE, row.names=1, sep='\t')
df.match.mus.brain
Ensure the SVGBulk tissues are in the aSVG file.
df.match.mus.brain$SVGBulk %in% feature.df$feature
Before co-clustered with bulk tissues, single cells are clustered and resultant clusters are refined to obtain more homogeneous single cell clusters (Figure 5.5-6) . Cluster labels are stored in the cluster column in colData.
clus.sc <- read_cache(cache.pa, 'clus.sc') # Retrieve data from cache.
if (is.null(clus.sc)) {
clus.sc <- cluster_cell(data=mus.lis.fil$sc.mus, min.dim=10, max.dim=50, graph.meth='knn', dimred='PCA')
save_cache(dir=cache.pa, overwrite=TRUE, clus.sc)
}
colData(clus.sc)[1:3, ]
Refine cell clusters.
cell.refined <- refine_cluster(clus.sc, sim=0.2, sim.p=0.8, sim.meth='spearman', verbose=FALSE)
Include matching information in colData.
cell.refined <- true_bulk(cell.refined, df.match.mus.brain)
colData(cell.refined)[1:3, ]
Co-cluster bulk and single cells.
roc.lis <- read_cache(cache.pa, 'roc.lis') # Retrieve data from cache.
if (is.null(roc.lis)) {
roc.lis <- coclus_roc(bulk=mus.lis.fil$bulk, cell.refined=cell.refined, df.match=df.match.mus.brain, min.dim=13, graph.meth='knn', dimred='PCA')
save_cache(dir=cache.pa, overwrite=TRUE, roc.lis)
}
The auto-matching results are listed in roc.lis$df.roc. predictor is the similarity (Spearman’s or Pearson’s correlation coefficient) between bulk and cells within a co-cluster, which is used to assign bulk tissues to cells (Figure 5.8). response indicates whether the bulk assignment is correct according to the matching table. index is the cell index in the SingleCellExperiment after cell clusters are refined.
roc.lis$df.roc[1:3, ]
table(roc.lis$df.roc$response)
Incorporate cell.refined in roc.lis for downstream co-visualization.
res.lis <- c(list(cell.refined=cell.refined), roc.lis)
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 11 (bullseye)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.30.0 igraph_1.3.1 scater_1.24.0 ggplot2_3.3.6 scran_1.24.0 scuttle_1.6.2 SingleCellExperiment_1.18.0
## [8] SummarizedExperiment_1.26.1 Biobase_2.56.0 GenomicRanges_1.48.0 GenomeInfoDb_1.32.1 IRanges_2.30.0 S4Vectors_0.34.0 BiocGenerics_0.42.0
## [15] MatrixGenerics_1.8.0 matrixStats_0.62.0 spatialHeatmap_2.1.3 knitr_1.39 BiocStyle_2.24.0 nvimcom_0.9-128 colorout_1.2-2
##
## loaded via a namespace (and not attached):
## [1] shinydashboard_0.7.2 utf8_1.2.2 tidyselect_1.1.2 RSQLite_2.2.14 AnnotationDbi_1.58.0 htmlwidgets_1.5.4 grid_4.2.0 Rtsne_0.16
## [9] pROC_1.18.0 munsell_0.5.0 ScaledMatrix_1.4.0 codetools_0.2-18 preprocessCore_1.58.0 statmod_1.4.36 av_0.7.0 withr_2.5.0
## [17] colorspace_2.0-3 filelock_1.0.2 highr_0.9 rstudioapi_0.13 labeling_0.4.2 GenomeInfoDbData_1.2.8 farver_2.1.0 bit64_4.0.5
## [25] distinct_1.8.0 rhdf5_2.40.0 vctrs_0.4.1 generics_0.1.2 rols_2.24.2 xfun_0.30 BiocFileCache_2.4.0 fastcluster_1.2.3
## [33] R6_2.5.1 doParallel_1.0.17 ggbeeswarm_0.6.0 rsvd_1.0.5 locfit_1.5-9.5 rsvg_2.3.1 bitops_1.0-7 rhdf5filters_1.8.0
## [41] cachem_1.0.6 gridGraphics_0.5-1 DelayedArray_0.22.0 assertthat_0.2.1 promises_1.2.0.1 scales_1.2.0 nnet_7.3-17 beeswarm_0.4.0
## [49] gtable_0.3.0 beachmat_2.12.0 WGCNA_1.71 rlang_1.0.2 genefilter_1.78.0 splines_4.2.0 lazyeval_0.2.2 impute_1.70.0
## [57] checkmate_2.1.0 BiocManager_1.30.17 yaml_2.3.5 reshape2_1.4.4 backports_1.4.1 httpuv_1.6.5 Hmisc_4.7-0 tools_4.2.0
## [65] bookdown_0.26 ggplotify_0.1.0 ellipsis_0.3.2 gplots_3.1.3 jquerylib_0.1.4 RColorBrewer_1.1-3 ggdendro_0.1.23 dynamicTreeCut_1.63-1
## [73] Rcpp_1.0.8.3 plyr_1.8.7 visNetwork_2.1.0 base64enc_0.1-3 sparseMatrixStats_1.8.0 progress_1.2.2 zlibbioc_1.42.0 purrr_0.3.4
## [81] RCurl_1.98-1.6 prettyunits_1.1.1 rpart_4.1.16 viridis_0.6.2 cowplot_1.1.1 ggrepel_0.9.1 cluster_2.1.3 magrittr_2.0.3
## [89] RSpectra_0.16-1 data.table_1.14.2 magick_2.7.3 grImport_0.9-5 mime_0.12 hms_1.1.1 evaluate_0.15 xtable_1.8-4
## [97] XML_3.99-0.9 jpeg_0.1-9 gridExtra_2.3 compiler_4.2.0 tibble_3.1.7 KernSmooth_2.23-20 crayon_1.5.1 htmltools_0.5.2
## [105] later_1.3.0 Formula_1.2-4 tidyr_1.2.0 geneplotter_1.74.0 DBI_1.1.2 dbplyr_2.1.1 MASS_7.3-57 rappdirs_0.3.3
## [113] Matrix_1.4-1 cli_3.3.0 metapod_1.4.0 parallel_4.2.0 pkgconfig_2.0.3 flashClust_1.01-2 foreign_0.8-82 plotly_4.10.0
## [121] xml2_1.3.3 foreach_1.5.2 annotate_1.74.0 vipor_0.4.5 bslib_0.3.1 dqrng_0.3.0 rngtools_1.5.2 XVector_0.36.0
## [129] doRNG_1.8.2 yulab.utils_0.0.4 stringr_1.4.0 digest_0.6.29 Biostrings_2.64.0 rmarkdown_2.14 htmlTable_2.4.0 uwot_0.1.11
## [137] edgeR_3.38.1 DelayedMatrixStats_1.18.0 curl_4.3.2 shiny_1.7.1 gtools_3.9.2 lifecycle_1.0.1 jsonlite_1.8.0 Rhdf5lib_1.18.0
## [145] BiocNeighbors_1.14.0 viridisLite_0.4.0 limma_3.52.0 fansi_1.0.3 pillar_1.7.0 lattice_0.20-45 KEGGREST_1.36.0 fastmap_1.1.0
## [153] httr_1.4.3 survival_3.3-1 GO.db_3.15.0 glue_1.6.2 FNN_1.1.3 UpSetR_1.4.0 png_0.1-7 iterators_1.0.14
## [161] bluster_1.6.0 bit_4.0.4 stringi_1.7.6 sass_0.4.1 HDF5Array_1.24.0 blob_1.2.3 DESeq2_1.36.0 BiocSingular_1.12.0
## [169] latticeExtra_0.6-29 caTools_1.18.2 memoise_2.0.1 dplyr_1.0.9 irlba_2.3.5
This project has been funded by NSF awards: PGRP-1546879, PGRP-1810468, PGRP-1936492.
Backman, Tyler W. H., and Thomas Girke. 2016. “SystemPipeR: NGS Workflow and Report Generation Environment.” BMC Bioinformatics 17 (1). https://doi.org/10.1186/s12859-016-1241-0.
Chang, Winston, and Barbara Borges Ribeiro. 2018. Shinydashboard: Create Dashboards with ’Shiny’. https://CRAN.R-project.org/package=shinydashboard.
Chang, Winston, Joe Cheng, JJ Allaire, Cars on Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2021. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.
Chen, Lihe, Jae Wook Lee, Chung-Lin Chou, Anil V Nair, Maria A Battistone, Teodor G Păunescu, Maria Merkulova, et al. 2017. “Transcriptomes of Major Renal Collecting Duct Cell Types in Mouse Identified by Single-Cell RNA-seq.” Proc. Natl. Acad. Sci. U. S. A. 114 (46): E9989–E9998.
Clark, Jevin Z, Lihe Chen, Chung-Lin Chou, Hyun Jun Jung, Jae Wook Lee, and Mark A Knepper. 2019. “Representation and Relative Abundance of Cell-Type Selective Markers in Whole-Kidney RNA-Seq Data.” Kidney Int. 95 (4): 787–96.
Karaiskos, Nikos, Mahdieh Rahmatollahi, Anastasiya Boltengagen, Haiyue Liu, Martin Hoehne, Markus Rinschen, Bernhard Schermer, et al. 2018. “A Single-Cell Transcriptome Atlas of the Mouse Glomerulus.” J. Am. Soc. Nephrol. 29 (8): 2060–8.
Li, Song, Masashi Yamada, Xinwei Han, Uwe Ohler, and Philip N Benfey. 2016. “High-Resolution Expression Map of the Arabidopsis Root Reveals Alternative Splicing and lincRNA Regulation.” Dev. Cell 39 (4): 508–22.
Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell Rna-Seq Data with Bioconductor.” F1000Res. 5: 2122. https://doi.org/10.12688/f1000research.9501.2.
Marques, Sueli, Amit Zeisel, Simone Codeluppi, David van Bruggen, Ana Mendanha Falcão, Lin Xiao, Huiliang Li, et al. 2016. “Oligodendrocyte Heterogeneity in the Mouse Juvenile and Adult Central Nervous System.” Science 352 (6291): 1326–9.
Morgan, Martin, Jiefei Wang, Valerie Obenchain, Michel Lang, Ryan Thompson, and Nitesh Turaga. 2021. BiocParallel: Bioconductor Facilities for Parallel Evaluation. https://github.com/Bioconductor/BiocParallel.
Ortiz, Cantin, Jose Fernandez Navarro, Aleksandra Jurek, Antje Märtin, Joakim Lundeberg, and Konstantinos Meletis. 2020. “Molecular Atlas of the Adult Mouse Brain.” Science Advances 6 (26): eabb3446.
Park, Jihwan, Rojesh Shrestha, Chengxiang Qiu, Ayano Kondo, Shizheng Huang, Max Werth, Mingyao Li, Jonathan Barasch, and Katalin Suszták. 2018. “Single-Cell Transcriptomics of the Mouse Kidney Reveals Potential Cellular Targets of Kidney Disease.” Science 360 (6390): 758–63.
Shahan, Rachel, Che-Wei Hsu, Trevor M Nolan, Benjamin J Cole, Isaiah W Taylor, Anna Hendrika Cornelia Vlot, Philip N Benfey, and Uwe Ohler. 2020. “A Single Cell Arabidopsis Root Atlas Reveals Developmental Trajectories in Wild Type and Cell Identity Mutants.” bioRxiv.
Vacher, Claire-Marie, Helene Lacaille, Jiaqi J O’Reilly, Jacquelyn Salzbank, Dana Bakalar, Sonia Sebaoui, Philippe Liere, et al. 2021. “Placental Endocrine Function Shapes Cerebellar Development and Social Behavior.” Nat. Neurosci. 24 (10): 1392–1401.